The Thing Metabolome Repository family (XMRs): comparable untargeted metabolome databases for analyzing sample-specific unknown metabolites

Abstract The identification of unknown chemicals has emerged as a significant issue in untargeted metabolome analysis owing to the limited availability of purified standards for identification; this is a major bottleneck for the accumulation of reusable metabolome data in systems biology. Public resources for discovering and prioritizing the unknowns that should be subject to practical identification, as well as further detailed study of spending costs and the risks of misprediction, are lacking. As such a resource, we released databases, Food-, Plant- and Thing-Metabolome Repository (http://metabolites.in/foods, http://metabolites.in/plants, and http://metabolites.in/things, referred to as XMRs) in which the sample-specific localization of unknowns detected by liquid chromatography–mass spectrometry in a wide variety of samples can be examined, helping to discover and prioritize the unknowns. A set of application programming interfaces for the XMRs facilitates the use of metabolome data for large-scale analysis and data mining. Several applications of XMRs, including integrated metabolome and genome analyses, are presented. Expanding the concept of XMRs will accelerate the identification of unknowns and increase the discovery of new knowledge.


INTRODUCTION
A major bottleneck in systems biology is the poor availability of complete datasets of chemical compounds (metabolome) present in the samples. Using high-sensitivity and high-throughput mass spectrometry (MS) in an untargeted manner, several thousands of signals derived from the chemicals are detected simultaneously from a sample. However, it is not possible to identify most of the chemicals because of the limited availability of purified standard chemicals required for the identification. To overcome this issue, bioinformatics methods for predicting chemical structures using the mass spectra of fragmented unknown molecules have been studied eagerly and a prediction accuracy of over 70% has been observed in recent years (1,2). Furthermore, an approach for structural elucidation based on the similarity of mass spectral features to those of known compounds (molecular networking) has been provided by the GNPS consortium (3). However, even when these prediction results are available, there is still a lack of information to help prioritize unknowns from the many candidates for further detailed investigation and identification. To identify a genuinely unknown chemical, the researcher will encounter high costs, for example for purification or organic synthesis and the direct determination of its chemical structure, along with considerable risk for failed identification owing to misprediction. A lack of resources for discovering and prioritizing unknowns for further detailed investigations after their selection by statistical analysis has emerged as a long-standing practical issue in untargeted metabolome studies and is considered a bottleneck in systems biology.
A potential way to discover and prioritize unknowns is to establish a public resource that provides sample-specific localization of the unknowns. Sample metadata such as tissue specificity and taxonomic relationships significantly improves metabolite identification (4). A successful example of the identification of unknowns based on sample specificity was biomarker discovery for specific cancer cells using a database (BinBase) of gas chromatography (GC)-MS-based metabolome data obtained from various samples (5). GC-MS, which is suitable for detecting volatiles and primary metabolites, is advantageous for data comparison by the reproducibility of the fragmentation based on electron ionization and the normalized retention time (retention index). In liquid chromatography (LC)-MS-based metabolomics, which is suitable for detecting a wide variety of liquid-soluble unknowns such as plant-derived secondary metabolites, sample specificity can also be used to discover unknowns. For example, Olivon et al. (6) discovered seven bioactive natural compounds by combining information about taxonomy, bioactivity, and MS/MS spectral similarity with in-house LC-MS data from 292 plant species in Euphorbiaceae. We also previously annotated tomato metabolites by comparing untargeted LC-MS data of tomatoes with those from Arabidopsis, Medicago truncatula, and Jatropha curcas (7). However, a public resource for LC-MS-based untargeted metabolome data has not been established.
A major reason for the lack of public resources is the difficulty of comparing LC-MS data between different studies in the public repositories. A considerable amount of metabolome data obtained from various samples has been accumulated in public repositories; examples include MetaboLights (8) and Metabolomics Workbench (9). However, to the best of our knowledge, there is no report of data mining of unknowns involving extensive use of the registered data across a wide range of samples. A major reason for this is the difficulty of data comparisons between different studies. It is even difficult to judge whether two given data are comparable by checking the analytical methods (metadata) and actual accuracy/resolution of the detector when the data are measured. Searches based on mass spectra and precursor ion mass, such as those provided by MASST (10), foodMASST (11), ReDU (12) and Metabolomics Workbench (9), are powerful tools for finding samples that may contain the queried metabolite. However, the absence and thus sample-specific localization of the queried metabolite cannot be examined using their datasets consisting of mixed conditions. In the case of mass spectra, controlling the dependence on the instruments and conditions, mass spectral quality, and coverage of the metabolite peaks are still aspects to be resolved for comparison. One approach to tackle this is to develop datasets obtained by a uniform condition. The dataset of ∼3600 foods from the Global FoodOmics Project used in foodMASST would be a good example of mass spectral data (13). However, comparable datasets for precursor ion mass data, which would be more advantageous than mass spectra in terms of covering the chemical space, are scarce (7).
Here, we report a series of LC-MS-based untargeted metabolome databases as public resources for discovering and prioritizing unknown metabolites based on their sample-specific localizations as evaluated by the precursor ion mass, retention time, and MS n or MS/MS spectra. First, we released the Food Metabolome Repository (FoodMR, http://metabolites.in/foods) in 2017, which contained data from 136 food samples (14). Since then, we have expanded the sample variety for FoodMR up to 222 foods and implemented several new functions, such as search functions based on the mass spectral similarity and neutral loss. Furthermore, we developed another database for plants (Plant Metabolome Repository, PlantMR, http:// metabolites.in/plants) and have been developing a database for anything (Thing Metabolome Repository, ThingMR, http://metabolites.in/things) (hereafter, we refer to the three databases as XMRs). We used specific LC-MS instruments and conditions to ensure the data comparison between arbitrary samples. Although the detected compounds are limited by the particular methods applied, we can evaluate the sample-specific localization of an unknown by the mass value of the precursor ion, retention time, and mass spectra of product ions. It is notable that the search, download, and browsing functions of the XMRs are available as application programming interfaces (APIs) to use the data from the other computational programs. The untargeted metabolome data have now been used directly by bioinformatics tools in systems biology. In this report, we briefly introduce the statistics and principal functions of XMRs, presenting examples of their practical use and precautions, and discuss the consequences of expanding the concept of XMRs.

Untargeted metabolome analysis
The details of the data acquisition and processing of untargeted metabolome data deposited to XMRs are provided in the Supplementary Methods. Briefly, a uniform metabolite extraction method and the two LC-MS platforms for FoodMR/PlantMR and ThingMR, respectively (Table 1), were used for data acquisition. PowerGet-Batch software (https://www.kazusa.or.jp/komics/software/ PowerGetBatch) (14,15) was used for peak detection, characterization, and alignment. The parameter setting files of PowerGetBatch are available on the download page of XMRs.

Statistics for ThingMR
The peak table consists of the valid peaks detected in 535 samples in ThingMR (March 2022) was constructed using the alignment function of the PowerGetBatch software The peak table (alignment results) and parameter setting files of PowerGetBatch were available from the download page of ThingMR. The other statistics were calculated and visualized using Microsoft Excel software (Microsoft Japan Co., Ltd) with the support of in-house Java programs.

Integrated metabolome and genome analysis
The orthologous gene groups of 80 plant species with scaffold or chromosome-level genome assembly information obtained from NCBI or Ensembl (Supplemental Table S3) were constructed using OrthoFinder version 2.3.12 with default parameters (17). Then, the gene presence/absence profile (where species that retained a gene of interest in their genome were coded as 1, and species that did not have a gene of interest were coded as 0) was calculated for each orthologous gene group. Likewise, the metabolite presence/absence profile (species with a TUP of interest were coded as 1, and species without a TUP of interest were coded as 0) was calculated for 120 plant metabolome data (ESI-positive mode, Supplementary Table S4) in the aligned 535 samples. Then, in order to discover the common phylogenetic distribution patterns between genes and metabolites, an all-to-all similarity comparison of the gene presence/absence profiles and the metabolite presence/absence profiles was performed using in-house Python and R scripts.

Other methods
The detailed method for datamining for novel flavonoid candidates, statistics for MetagoLights, and identification/annotation procedures for metabolites described in the 'Application of XMRs' section are provided in the Supplementary Methods.

Development of XMRs
We first established the Food Metabolome Repository (FoodMR, http://metabolites.in/foods) in 2017, with data from food items analyzed in an untargeted manner using reverse-phase LC coupled with high-resolution MS (Table 1) (14). Foods were selected as the target samples because they contain a large variety of chemicals derived from various biological sources and processing techniques, such as mixing, cooking, and fermentation; therefore, the database could be helpful in research fields beyond food science. Since the initial report containing data from 136 foods (14), we have expanded the number of samples to 222. The details of the samples and analytical methods (metadata) are hosted by Metabolonote (16) and therefore searchable through MetabolomeXchange (http://www. metabolomexchange.org/).
We used a uniform analytical method and a data analysis procedure for all samples to ensure arbitrary data comparison and then depict sample-specific localization of the unknowns. The identity or similarity of the peaks was examined by the mass value and retention time at 5 ppm and 1% (∼1 min) tolerances, respectively. In addition, the similarity of the peaks can be examined by their multi-stage mass spectra (MS 2 and MS 3 ) obtained by data-dependent acquisition (DDA) using ion-trap MS if available.
The FoodMR provides some additional information for peak annotation. The compound database search results after searching by the measured mass values and assigned adduct ions of the peaks were available. The following compound databases were used: KEGG (18), KNApSAcK (19), HMDB (20), LIPID MAPS (21)  tation of the peak, mass chromatogram data converted to mzXML format are available. Using the software supporting the mzXML format, such as MassChroViewer (https:// www.kazusa.or.jp/komics/software/MassChroViewer) (14), users can analyze the data and annotate the unknown peaks in detail.
We developed the PlantMR (http://metabolites.in/plants) in 2020 for plant samples, including the inedible parts of plants uncategorized in foods (Table 1). FoodMR and PlantMR are compatible as essentially the same procedures are adopted, although the control of the retention time in PlantMR is not as strict as in FoodMR. Data from 28 plant-related samples, including model plants, Arabidopsis, rice, Lotus japonicas, tomato, cultured tobacco cells, Physcomitrella patents, Marchantia polymorpha, and poplar are available. Some samples contain the predicted atom numbers of nitrogen and sulfur in the chemical structure of the peaks estimated by the comparison with fully labeled plant samples using 15 N or 34 S. This information helps the metabolite annotation.
Since 2021, we have been developing the ThingMR (http://metabolites.in/things), suitable for any samples, because--as mentioned later--the compound annotation is improved synergistically by enlargement of sample va-D664 Nucleic Acids Research, 2023, Vol. 51, Database issue riety. Four bio-resource centers that joined the National Bio-Resource Project (NBRP, https://nbrp.jp/en/resourcesearch-en/) in Japan provided 132 samples from basic strains of model organisms (July 2022). We used another LC-MS condition for constructing ThingMR because we could not add new data to FoodMR and PlantMR owing to discarding the LC-MS instrument used. Therefore, the data in ThingMR are not fully compatible with those in FoodMR and PlantMR (Table 1). Although the MS 3 spectra are not available in ThingMR, the mass accuracy of the MS/MS fragmentation is higher (<20 ppm by Q-ToF MS) than in FoodMR and PlantMR (<0.5 Da by ion-trap MS). This is advantageous for compound prediction based on the mass spectra. Although the column, solvent, and gradient program applied for ThingMR differed from those applied for FoodMR and PlantMR, we can approximate the retention time of the peak based on the regression curve of the retention time of peaks commonly detected in similar samples. Therefore, we can search the peaks by specific m/z values and retention times across the three databases (see the section 'Peak search by mass value and retention time' and Figure 3).

Statistics
The statistics for the peaks characterized in the XMRs are summarized in Table 2. More than 4200 peaks were detected on average in a single sample in both ESIpositive and ESI-negative modes. Among them, 14%-21% and 7% had fragmentation mass spectral information in Food/PlantMR and ThingMR, respectively. The higher rate in Food/PlantMR might be attributed to, in part, a longer gradient time and the analytical replicates. The peak ratios assigned as [M + H] + or [M-H] − were lower in Food/PlantMR (78-83%) than in ThingMR (91%-97%). The ratios of the peaks with compound database search results were also lower in Food/PlantMR (36-46%) than in ThingMR (79%-84%). These differences are attributed to the higher mass resolution in Food/PlantMR and the consequently lower mass tolerances given for the search (Table  1). A higher rate of the peaks with FlavonoidSearch results was observed for MS 3 spectra than for MS 2 spectra in both FoodMR and PlantMR. This result is in agreement with the concept that modifications such as glycosylation contributed to the increased diversity of flavonoids (23).
An unsaturated number of peaks. We examined the increase in the number of peaks per the increase in the number of samples. Using the data from 535 samples in ThingMR (March 2022), we aligned the same or similar peaks between the samples and created a peak table. In total, 667 417 and 461 121 alignments (referred to as TUPs) assigned as [M + H] + for ESI-positive and [M-H] − for ESI-negative modes, respectively, were generated from the peak data of 524 samples except those of standard chemicals. Using the data, the number of TUPs was counted in randomly selected samples (Supplementary Figure S1). The increase in TUPs was rapid from 2 to ∼200 samples, and then occurred gradually till 400 samples, and finally became almost linear at rest. This result suggested that the variety of detectable chemicals is not saturated with 524 samples. The expected increase in unique peaks per addition of a new sample was estimated in two ways. The first is based on the regression lines calculated with the data obtained from 400-524 samples (Supplementary Figure S1). The slopes of the regression lines suggested that approximately 666 and 450 unique peaks were included in a single sample in positive and negative modes, respectively. The second is based on the distribution of TUPs commonly detected in the samples. As shown in Supplementary Figure S2, the distribution strongly followed the power laws. Therefore, using the number of TUPs detected in only a single sample (326,396 for ESI-positive and 221,996 for ESI-negative modes), we calculated the average TUPs per sample as 623 and 424 for ESI-positive and ESI-negative modes, respectively. These results were in good agreement with the first estimation.
The uniqueness of the chemical profile of the sample. To represent the uniqueness of the chemical profile, we defined averaged peak share rate (APSR) of the samples. The PSR, defined as the rate of samples where a TUP was detected to the total sample number (524) was calculated for each TUP. Then, the averaged value of the PSRs of the TUPs detected in a sample was defined as APSR. Therefore, the sample with a smaller APSR would have a larger number of sample-specific peaks (Supplementary Table S1, Figure 1). The samples with smaller APSRs included animal samples (ragworms, urine of cat and dog, etc.), bacteria (yeast), and environmental samples (water from a paddy field, etc.). These are in the less frequently analyzed sample category in ThingMR. The samples in ThingMR had various APSRs and peak numbers ( Figure 1, Supplementary Table S1). Rosemary (Lamiaceae) and Inuyomogi and Ryuno-giku (Asteraceae) had a large number of peaks and smaller APSRs, although many samples in these families are included in ThingMR. These species may biosynthesize many species-specific chemicals. When the distribution was viewed by categories, the plants were scattered widely. Foods contained a lower number of peaks and higher AP-SRs. This suggests that most of the chemicals we eat in foods are ubiquitous. The distribution shown in Figure 1 will change in future, as the number of samples increases. The samples in the category with lower APSRs should be actively analyzed to efficiently enhance the coverage of chemicals.

Functions of XMRs
This section briefly introduces the major functions of XMRs for obtaining insight into sample-specific localizations of the metabolites peaks, namely, search functions and APIs. XMRs also provide essential functions as webbased databases, such as browsing peak lists ( Figure 2) and peak details ( Figure 3) and downloading the raw and processed data files. In addition, the mass chromatogram data, presented in two-dimensional images, are available in Microsoft PowerPoint files named 'MassChroBook.' The two-dimensional pictures help to present intuitively the similarity/difference in the metabolic profiles that may be missed out by statistical methods such as multivariate analysis.

Peak search by precursor ion mass value and retention time.
Users can search peaks detected in the samples based on the precursor ions that match a given accurate mass value and a retention time of LC ( Figure 4). The search result was represented as a table in which the rectangle icons for the matched peaks are arranged in the columns corresponding to their nominal retention time. The peak icon shows the peak intensity and retention time by the color and the number, respectively. This presentation of the results table helps users to grasp the sample specificity of the queried peak.
The small rectangles at the bottom of the peak icon show the other characteristics of the peak, namely the availability of MS n spectra, the availability of database search results, the type of adduct ion assigned, and the Flavonoid-Search score. These help users to check the appropriateness of adduct ion assignment, obtain the MS n spectra when it lacks in the queried peak, and facilitate further annotation of the peak.
In the case of the peaks registered in XMRs, users can directly perform the search using the peak information by two procedures. One is selecting the row of the peak on the peak list of the sample and clicking the 'Search similar peak' button ( Figure 2). The other is clicking the 'Search similar peak' button on the detailed peak information page ( Figure  3). From the peak information page, users can search for similar peaks in the other XMR databases based on the m/z value and the approximated retention time (Figure 3). More generally, as described below, in the case of the peaks measured by other LC-high resolution MS platforms or compounds of known formula, the users can obtain the potential counterparts in XMRs by not specifying the retention time in the search. The results show the candidate isomers and their sample specificity. The number of possible isomers and differences in their MS n spectra will also help the prioritization and structural annotation of the unknown peak. Furthermore, users can identify the counterpart peak in XMRs if similar LC-MS conditions were applied in the user's platform by following these steps: analyzing the same or similar sample using the user's platform; identifying the commonly detected peaks in the platform and XMR; and calculating the equations to convert the retention time from one to the other based on, for example, the regression curve for the retention times of commonly detected peaks. As an example, we present the construction of a converter for the LC-MS systems used in a study in MetaboLights (MTBLS771) (24) (Supplementary Data 1). We have also provided an MS Excel file for the conversion between ThingMR and FoodMR/PlantMR on the download page. Therefore, we can generally use the peak information in XMRs as a reference for annotating unknown peaks detected by a wide range of LC-MS instruments.
The following given tolerances for mass value and retention time are recommended for searching: mass value, 5 ppm (FoodMR and PlantMR) and 20 ppm (ThingMR); and retention time, 1 min (FoodMR), 2 min (PlantMR) and 0.5 min (ThingMR) ( Table 1). The actual mass accuracy of FT-ICR MS used in the FoodMR and PlantMR is less than 2 ppm in most cases. However, some peaks with higher intensity showed mass drift up to 5 ppm. Therefore, we recommend giving 5 ppm mass tolerances for arbitrary peak comparisons for FoodMR and PlantMR. The actual mass D666 Nucleic Acids Research, 2023, Vol. 51, Database issue  Figure 1, the number represents a sample specificity measure of the peak and facilitates searching sample (group)-specific peaks. accuracy of Q-ToF used in the ThingMR is below 5 ppm in most cases. However, in some cases, such as peaks with lower intensity, the accuracy is not within 15 ppm. Therefore, we recommend allowing 20 ppm for ThingMR. The reproducibility of the retention time is usually <1% of the gradient time. Users can check the reproducibility using the standard chemical peaks available on the 'Compound' page in ThingMR or the internal standard peaks in the Mass-ChroBook or mass chromatogram data.
Peak search by mass spectra. The cross-sample distribution of the peaks with similar mass spectra can be exam-ined using the search functions based on the mass spectral similarity ( Figure 5). Because DDA is applied for MS n or MS/MS analysis in XMRs, fragmentation data are available for only a proportion of the peaks. Nevertheless, we can obtain the sample-specific localization of the peak by combining the mass spectral search results with the precursor search function mentioned above. This is one of the most significant differences between XMRs from the other databases solely based on mass spectra. By clicking the 'Search spectra' button on the detailed peak information page, users can immediately perform the search using the spectrum (Figure 3).   The search results are provided along with images of mass spectra to allow intuitive judgment of the identity/similarity of the spectra ( Figure 5). The mass spectral similarity is estimated by the cosine product correlation coefficient. Because the similarity score depends on the number of product ions, we cannot determine a specific threshold value to judge the identity of the metabolites. Therefore, the XMRs provide images (bar graphs) of mass spectra in the results table ( Figure 5). In the calculation of the similarity score, the mass spectral data were rounded to nominal mass values. The rounding was adopted because of the low mass accuracy of ion-trap MS (<0.5 Da) used in FoodMR and PlantMR and to accelerate the search rate.
It is unique for FoodMR and PlantMR that the MS 3 spectra of the unknowns obtained from untargeted metabolome analysis are searchable among a wide vari-D670 Nucleic Acids Research, 2023, Vol. 51, Database issue ety of samples. We can search for candidates of compound derivatives that display the same fragmentation in MS 3 as in the MS 2 spectrum of the compound. This helps, for example, achieve a comprehensive search for potential glycosides of a known or unknown aglycone.
As a precaution for general use, please note that the compatibility of the mass spectra between platforms should be considered for a proper understanding of the search results. The mass spectrum of a compound obtained by linear-iontrap MS in FoodMR and PlantMR differs from that obtained by Q-Tof MS in ThingMR. Similarly, the spectra in XMRs differ from the users' own data. Therefore, when the cross-sample distribution of the peaks that have similar mass spectra is evaluated, users should seek a counterpart peak in the XMRs first and then, if found, perform the spectral search based on the counterpart peak. The mass spectral search of XMRs did not aim at searching for the own mass spectra of the users by themselves, as provided by MASST (10) and the mass spectral library MassBank (25).
APIs. It is a significant feature of XMRs that the untargeted metabolome data obtained from a wide range of samples are ready to use for bioinformatics via application programming interfaces (APIs). XMRs provide APIs in representational state transfer (REST) format for most of all the available functions on web browsers. Therefore, the users directly use these functions from the external computational programs. The APIs allow the users to perform the automatic and massive search of a large number of data and subsequent complex data analysis, which is not practical to perform only in web browsers. The search results of the APIs are available in JavaScript Object Notation (JSON) format, which is suitable for processing with computational programming languages such as Python. Sample program codes for the use of the APIs written in Python are available on the help page. Sample codes for searching candidates of novel flavonoids (described later) and searching peaks specifically accumulated in certain specific sample groups (e.g. helpful for biomarker discovery) have been currently provided (July 2022). A URL-based RESTful format as the API input is also advantageous for easy and precise recording and sharing of specific information in XMRs, search conditions, and so on. Examples are shown in the column 'Link to the peak information in FoodMR' in Table 3.

Application of XMRs
An integrated analysis for discovery of metabolite and gene candidates in species-specific metabolic pathways. Using the peak table constructed from 535 samples in ThingMR (March 2022) and the published genome data, we calculated the pairs of metabolite peaks and orthologs that are detected in specific biological classes ( Figure 6). We found a metabolite peak (m/z 207.057 in positive mode, RT 10.67) that was specifically detected in Brassica oleracea, Brassica rapa, Capsella bursa-pastoris and Raphanus sativus, and not detected in other plant species. A candidate for this peak was an intermediate metabolite in the glucosinolate biosynthesis pathway, 3-indolylmethylthiohydroximate. We found 905 orthogoups that showed the same plant species-specificity; among them, four orthogroups were indole glucosinolate O-methyltransferases. The results suggest that this metabolome-genome integrated analysis facilitated the discovery of reasonable metabolite and gene pairs in species-specific metabolic pathways. Using the same approach, we found a pair of putative isomers of cucurbitacin S (m/z 481.29 in positive mode, RT 22.0 and 18.44) and an ortholog cytochrome P450 89A2 that were detected in Cucurbitaceae species, Citrullus lanatus, Cucumis melo and Cucumis sativus, but not in other plants. The detailed metabolic pathway for the biosynthesis of cucurbitacin S and the specific substrate of the cytochrome P450 89A2 are not fully understood. Therefore, the results provide a new working hypothesis for further identifying candidate metabolites and studying this pathway in depth. The precise annotation procedures for these metabolites are described in the Supplementary Methods. Briefly, the candidates were annotated using precursor ion mass and MS/MS spectra at the confidence level of 3 (putatively characterized compound classes) proposed by the Metabolomics Standards Initiative (MSI) (26,27).

Annotation of carpaine-related metabolites in papaya.
The sample-specific localization of peaks in FoodMR supported annotation of the carpaine-related biosynthetic and/or degradation intermediates in papaya (28). Carpaine, found in papaya, is an alkaloid with antiviral and antiplasmodial activities (29), but the biosynthetic pathway of carpaine is not fully understood. Hiraga et al. (28) annotated eight carpaine-related metabolites, including carpaine, carpaic acid, and three novel putative structures in papaya fruits, based on their accurate mass values and MS/MS spectra obtained by LC-MS. They did not identify these metabolites, probably owing to the limited availability of their authentic standards. However, they showed strong support for the annotation using FoodMR in which untargeted metabolome data from mature papaya fruits are registered. The mass values of the eight carpaine-related metabolites, except two abundantly accumulated in immature fruits, were detected in the mature papaya fruits in FoodMR. Furthermore, the mass values of these were not detected in other 221 foods. As carpaine was found in specific plants, papaya and Azima tetracantha (30), these results strongly suggested that the annotated metabolites were carpaine derivatives. Using the APIs of FoodMR, we were able to find other candidate carpaine derivatives specific to papaya, e.g. a putative derivative of dehydrocarpamic acid (http: //metabolites.in/foods/peak/07109/pos/2567) with the same MS 3 spectrum as that of dehydrocarpamic acid. We annotated the candidate using precursor ion mass and MS 2 and MS 3 spectra at the MSI confidence level of 3 (putatively characterized compound classes) (Supplementary Methods). These candidates would be prioritized for further identification and functional estimation.
Identification of okaramines in the rhizosphere of hairy vetch. The accumulation of food metabolome data led us to the identification of non-food-derived compounds, okaramines. We identified pesticidal okaramines for the first time from nature (31). Okaramines were first identified in soybean pulp (okara) inoculated with Penicillium simplicis-Nucleic Acids Research, 2023, Vol. 51, Database issue D671  simum AK-40 (32) and were found to have insecticidal activity same as that of ivermectin (33). However, the presence of okaramines in nature is unknown. We found several candidates of okaramine species in the rhizosphere soil of the manure plant hairy vetch (Vicia villosa Rotch subsp. villosa) in a metabolome analysis of soil samples. The MS n spectra showed good agreement with their chemical structures. In addition, we found that no peaks matched the okaramines of the 969 000 peaks from 222 foods (ESIpositive mode) registered in FoodMR. The absence in general foods was reasonable when assuming the candidate peaks were okaramines that were probably derived from soil bacteria in the rhizosphere. This was a strong driver for us to identify okaramines using authentic standards. Finally, we identified okaramine A, B and C at the MSI confidence level of 1 (identified compounds) by the accurate precursor ion mass, retention time, and MS 2 and MS/MS spectra (Supplementary Methods).
Discovery of novel flavonoid candidates. By data mining using the APIs of XMR, we found several candidates for novel flavonoids in this study. The peaks in FoodMR that matched the following conditions are potential candidates of novel flavonoid derivatives: (i) no results in the compound database search; (ii) MS 3 spectra and (iii) a high similarity score from FlavonoidSearch for MS 3 spectrum (see Supplementary Methods). A search with multiple conditions like this case is not practical for manual operation in a web browser. Using APIs, we can perform complex searches easily and within a short time. Through a search of approximately 969 000 peaks in FoodMR (ESI-positive mode), we found 23 peaks that matched the conditions above in 10 min. Then, we manually checked the sample specificity and features of the putative substituents using the FoodMR website. We found some interesting candidates, such as those detected in specific foods and those with neutral loss masses that have not been well reported previously (Table 3). Although the candidates were not identified using authentic standards (annotated at the MSI confidence level of 3 'putatively characterized compound classes,' see Supplementary Methods), they might be targets in a future study. The Python program code used for this search is available on the help page of FoodMR as an example of the use of APIs. During this investigation, we noticed that sample specificity in non-food data helped to annotate the peaks in foods. Among the 23 peaks, a single candidate was hard to annotate by its sample specificity in FoodMR. The peak with m/z 587.07 was specifically detected in the food category 'Sugars and sweeteners' and foods in 'Seasonings and spices,' which includes sugars. The peak was present in brown sugar lumps and 'Wasanbonto' (traditional noncentrifugal sugar), but absent in granulated sugar. The key to annotating the situation was determined by PlantMR. We found many peaks in rice leaves with the same MS 3 spectra as that of the unannotated peak. This information reminded us that both brown sugar lump and 'Wasanbonto' are made from sugar cane. Therefore, the peak in the sugars might be a flavonoid with an aglycone actively biosynthesized in Poaceae.
Detection of caffeine in honey. We found a peak annotated as caffeine--a plant-derived alkaloid established to accumulate in coffee and tea--was also detected in honey in FoodMR. We identified the peak in ThingMR as caffeine using an authentic standard compound at the MSI confidence level of 1 (identified compounds) based on the identical accurate mass value of the precursor ion, the retention time confirmed by co-injection, and MS/MS spectra (Supplementary Methods). We found that the presence of caf-feine depends on the honey products; namely four out of seven honey products contained caffeine. This observation was in good agreement with previous reports. Some citrus plants accumulate caffeine in the flowers and nectar, and caffeine can be found in honey (34). Moreover, rewarding honeybees with caffeine enhanced their memory of the floral scent (35). Therefore, caffeine production in the flower is understood as a strategy for increasing reproductive benefits by enhancing the pollinator's fidelity (35). Although this case was a rediscovery of previous knowledge on the known compound caffeine, is also showed that the comparison of the large variety of samples with XMRs could lead to the construction of new and proper working hypotheses without any prerequisite knowledge in a specific research field.

Precautions for data interpretation and use
We should draw users' attention to precautions to avoid mis-and over-interpretation of XMR results, although some of them are mentioned elsewhere in this study. In addition, when using the XMR results for further establishment of working hypotheses and investigation, it is always necessary for users to confirm the results. First, the sample specificity results from the precursor ion mass search depend on the samples deposited at the time of the search. Please consult the kind of samples (presence or absence) using the sample list for appropriate interpretation. Moreover, sample specificity results based solely on the mass spectral search are further dependent on the coverage of mass spectra obtained through DDA. Combined use of the precursor search would be required for proper interpretation (see the section 'Peak search by mass spectra'). Second, peak quality should be considered. Please note that a significant number of metabolites are undetectable under the LC-MS conditions used in XMRs. For example, as shown on the 'Compound' page, only 6 out of 20 amino acids were separated, ionized, and detected in ThingMR. Caffeine was only detected by the ESI-positive mode. The unique 'not detected' information for examined authentic standards on the 'Compound' page would help users in speculating the detectable metabolites based on their hydrophobicity, mass values, etc. The similarity of the precursor ion mass, retention time, and MS/MS spectra to those of authentic standards strongly suggests that the peak is the standard compound. However, the peak could still not be the standard but a rather similar isomer that cannot be separated and distinguished under our conditions. At the very least, our system cannot distinguish between most of stereoisomers. Third, peak quantity (intensity) is neither an absolute nor accurate value because it depends on various factors such as the amount of sample injected, metabolite extraction and ionization efficiency, ion suppression or enhancement, detector sensitivity at the time of analysis, and peak signal distribution (for the log-transformed intensity centered by the median). Finally, more general remarks should be made. The peak data in XMRs contain potential false positives and false negatives. Failures in estimating precursor ion mass could also occur as a result of monoisotopic ion peak mischaracterization. This case is more likely to occur in the case of a higher mass value and a multivalent peak with a low-signal intensity for the monoisotopic ion. Misestimation of adduct ions is also possible. The alignment results on the download page may contain misalignments. The database search results and FlavonoidSearch results only provide candidates and do not guarantee the presence or absence of the compounds. These possibilities must be carefully examined during further investigations, especially for peaks annotated solely by precursor ion mass values.

DISCUSSION
Comparable untargeted metabolome data, which have been lacking in systems biology and are now provided by XMRs, are a good resource for further data-driven research into unknown chemicals based on their sample-specific localizations. As exemplified in the integrated analysis of the metabolome and the genome, the data resource facilitates the use of metabolome data in multi-omics studies. In the example of papaya and okaramines, FoodMR provided appropriateness for the presence of both the unknown unknowns (truly novel compounds) and known unknowns (known compounds not described in the sample) (36), respectively, in the specific samples. Using caffeine as an example, the comparison of the various samples that are not compared in a general design of a single study would find an unexpected occurrence of metabolites that can lead to new working hypotheses. In the example of the data mining for novel flavonoids, the APIs of XMRs were used as a powerful bioinformatics tool for the top-down discovery of unknowns based on their sample specificity. Although the sample specificity-based discovery of unknown metabolites has been reported using GC-MS-based database (5), as far as we know, XMRs is the first public databases in which LC-MS-based untargeted metabolome data are comparable based on the precursor ion mass, retention time, and MS n or MS/MS spectra in total 984 various samples. The provision of such datasets will strongly promote further use of the metabolome data, for example, correlation analysis of metabolites and genes and the discovery of unknown metabolites for quality control markers of specific organisms, among others.
Information on the precursor ion rather than the MS/MS or MS n spectra is useful in a public resource for depicting the sample-specific localization of the metabolites. MS/MS or MS n spectra are often used for comparisons of structural identity and the prediction of the structures of the metabolites (3,37). However, the data are dependent on the instruments and collision energy conditions; hence, the interpretation and confidence of the results remain under discussion. In contrast, the accurate mass value of the precursor ions is robustly obtained with commonly used highresolution MS. As exemplified here, the papaya-specific localization of the carpaine-related candidates was successfully examined based on the accurate mass values obtained by a different MS platform from that used in FoodMR (28). Thus, the use of precursor ions is advantageous for enlarging the comparison spaces of the metabolites. We also previously demonstrated the usefulness of the accurate mass records (AMRs) for the annotation of unknown metabolites and pointed out the limited availability of AMRs in the public domain as an issue for promoting the annotation (7). XMRs provide more than 11 107 619 AMRs obtained D674 Nucleic Acids Research, 2023, Vol. 51, Database issue from 984 samples in total ( Table 2, July 2022). The availability of these data on the XMR website and via APIs should promote the discovery, annotation, and identification of unknowns.
To assist further valuable knowledge discoveries, the concept of XMR should be expanded, and some issues should be solved, as discussed below.

Expansion of the concept of XMR
The concept of XMR should be expanded through the following four aspects.
Variety of the samples. The variety of the samples should be increased. As exemplified in the discovery of novel flavonoids, the non-food data obtained from plant samples helped to annotate the unknown metabolites found in foods. Conversely, the okaramine cases showed that food data helped to identify the metabolites derived from nonfood samples. These results obtained by the previously unexpected specific comparisons suggest that increasing the sample varieties synergistically accelerates the annotation of the unknowns based on their sample-specific localizations. Therefore, not only the samples in specific categories (such as foods and plants) should be measured, but also any other samples (e.g. from animals, bacteria, the environment, waste, artificial products, historical samples) can be added into a dataset for comparison. Consequently, we are now expanding ThingMR. As demonstrated in Supplementary Figure S1, the unique peaks have not yet been saturated. The index APSR (Figure 1) we proposed in this article would facilitate selecting the category of the unanalyzed samples.
Coverage of the metabolites. The coverage of metabolites should also be expanded by the addition of several robust instrumental platforms. Only a single reversed-phase LC method was used for the construction of XMRs. Therefore, a large portion of the metabolites, such as highly hydrophilic compounds (e.g. sugars, organic acids and most amino acids), highly hydrophobic compounds (e.g. carotenoids and non-polar lipids), and volatiles (e.g. low-molecular terpenes, aldehydes, and alcohols) are omitted and not compared in the current XMRs. The use of other separation technologies for constructing them, such as lipid-focused LC-MS or supercritical fluid chromatography-MS, hydrophilic interaction chromatography (HILIC)-MS for hydrophilic compounds, GC-MS for volatiles, and CE-MS for ionic water-soluble compounds, and the establishment of robust measurement procedures for arbitrary comparison of a large number of samples are required. Also, the mass accuracy and coverage of MS/MS spectra should be improved using high-spec MS instruments. Some widely targeted approaches will facilitate the annotation of the data. Furthermore, any other compound detection technologies, such as chemical sensors, may be applicable as long as they are reproducible and robust for large-scale comparison. The application of multiple platforms to the same sample set should strongly enhance the annotation and discovery of novel metabolites. For example, the combined use of LC-MS and GC-MS for various samples would be helpful in discovering the volatiles and their watersoluble glycosides as storage forms by correlation analysis of their sample specificity.
Number of comparable datasets. Not only our XMRs, but also other comparable datasets should be constructed in different countries. For the efficient enlargement of the abovementioned points (sample variety and metabolite coverage), the specification of a base center that can actively produce high-quality and comparable data is ideal, for two main reasons. First, a general metabolome analysis service where samples are provided by the researchers is not suitable because the sample diversity is affected by researchers' interests, causing it to be biased toward some specific samples, such as humans and model organisms. For example, out of 128 156 data entries measured by reversed-phase LC-MS published from MetaboLights (January 2022), 50.9%, and 16.1% were samples from humans and perennial ryegrass, respectively; only 767 unique organism parts were present. The active collection and analysis of unanalyzed samples are required to enlarge the sample variety efficiently. Second, a general analysis service/center where the analysis methods are customized according to the users' requests is not suitable to maintain sufficient data quality. The restricted use of the analytical instruments in constant conditions with a single or several specific method(s) is required for the production of robust and comparable untargeted metabolome data. This machine use policy is especially important for producing high-quality data using multiple platforms for expanding metabolite coverage. However, it is not practical to establish a centralized base that analyses all 'things' from all over the world owing to the maintenance costs, throughput bottlenecks, and need for international shipping of the materials. Therefore, multiple datasets in which the sample data are comparable in the dataset should be constructed by individual base centers established in various countries. As exemplified here, even a single dataset contributes to knowledge discovery. Furthermore, some datasets can be compared based on the similarity of the analytical methods, as demonstrated in the comparisons between FoodMR and ThingMR or data from MetaboLights and XMRs (Supplementary Data 1). Data obtained from a common sample readily available in every country, such as a major crop, could be used to link and standardize the datasets.
Bioinformatic researchers. An increase in the number of bioinformatic researchers who discover new knowledge and working hypotheses and add value to the database is essential for expanding the concept of XMR. We expect that the number of bioinformatic researchers who are interested in XMRs will increase as the abovementioned points occurs. The provision of APIs, which enables the integration of metabolome data to other studies, such as genome, transcriptome, proteome, and phenome data, is critical for the bioinformatic use of XMR. For this purpose, further enrichment of sample metadata, such as biological taxonomy, treatment, and processing, as well as provision of the metadata described with proper ontology and controlled vocabulary as a machine-readable format should be promoted in the future (see the next section). Broadening the research Nucleic Acids Research, 2023, Vol. 51, Database issue D675 backgrounds of these bioinformatic researchers will help to achieve a multifaceted analysis of untargeted metabolome data and lead to discoveries of unexpected, new knowledge and hypotheses.

Further improvements
Enrichment of sample metadata. For further data analysis based on metadata, such as that performed with ReDU (12) and foodMASST (11,13), and for appropriate interpretation of the results, the sample metadata needs to be enriched. In FoodMR, we adopted the sample names, sample IDs, and category names that were defined in the Standard Tables of Food Composition in Japan-2015 (Seventh Revised Version). For PlantMR and ThingMR, however, we assigned simple category names that we defined ourselves and did not actively work on assigning ontology and using controlled vocabulary, although the taxonomy IDs from NCBI were available for the samples in ThingMR. This policy is because of two reasons. First, the required ontology depends on the users' purposes. The use of ThingMR is diverse because it contains data from various samples. Therefore, enrichment of the metadata would begin when the specific target is focused on by the users' community in particular study fields. Second, the attachment of ontology would be a bottleneck in data publication. Attaching proper ontology and controlled vocabulary to the sample metadata requires a deep understanding of the definition of the classes and vocabulary. Improper attachment and bias of the completeness of the work among the samples could lead to misinterpretation of the results. For these reasons, we provide simple category information, and currently, the annotation of metadata is left to the user. However, the next enhancement in XMRs would be the attachment of rich metadata like the ReDU project of GNPS (12). Any support from the research community for assigning appropriate and confidential metadata is helpful. We would certainly bridge questions from the users about the metadata of particular samples to the person responsible for those samples.
Confidence in the peak data. Confidence in the peak data should be improved, especially at the qualitative level. XMR contains many latent false-positive and false-negative peaks arising from both the analytical method and data processing. Most of the false positives are possibly derived from the data processing. In XMR, we adjusted the peak detection parameters of the processing to pick as many peaks as possible with lower intensities, which improves the assignment of adduct ions and the discrimination of isotopic peaks. However, although this increases the false-positive rate, this is not considered a major drawback to the estimation of the sample specificity of the unknown peaks because the sample specificity should also be considered from the peak intensities. In contrast, a major cause of the false negatives is considered to be the analytical method, namely, the application of a single method for metabolite extraction and the limitation of the injected sample amount on the LC to avoid carryovers. This issue can likely be overcome by the application of additional analytical procedures with different extraction methods, as mentioned in Expansion of metabolite coverage. Future increases in the analytical rep-etition of similar samples will contribute to minimizing the false positives and false negatives.
The annotation of the artificial peaks and provision of such information should be improved in the future. It is known that a significant number of peaks detected by untargeted LC-MS are those of derivative ions derived from the same compounds, such as various adduct ions, multimers, and in-source fragments; thus, the real number of unique compounds in a sample is much smaller than the observed number of peak (38)(39)(40). In the current XMR, we provide all peak data, including such derivative ion peaks, because their information may, in some cases, help with metabolite annotation. For example, in the tomato data in ThingMR, a peak of tomatidine is detected as derivative ion peaks as an in-source dissociation of a glycoside, putative tomatine. As the free form of tomatidine is not detected in the same sample, the fragmented peak helped to annotate tomatine based on the MS/MS spectra. In the current XMRs, the variations in the adduct ions and the dissociations of well-known small molecules, such as water, are provided on the peak details page (Figure 3). Possible in-source fragment peaks should be provided in future improvements.
There is no efficient way to avoid the falsenegative/positive peaks or under-/overestimations of the peak intensity caused by ion suppression/enhancement by the co-eluted compound while using the current LC-MS conditions. This issue could be overcome by the application of other ionization principles; one example would be probe electrospray ionization (PESI), which is known to be tolerant to ion suppression (41).
Quantitativity of the metabolites should be considered in the future. For the current purpose of XMR, the quantitation accuracy of the detected peaks is not crucial for the discovery and prioritizing the unknowns because the samplespecific accumulations should be evaluated in the subsequent detailed investigations. However, the addition of precisely identified and quantified data definitely improves the value of data mining through the datasets. Therefore, the application of (widely-) targeted metabolome data should be added in the future.
Identification of the unknowns. The acceleration of metabolite identification is the ultimate issue in the study of unknown metabolites. After focusing on the unknowns for further detailed investigation using XMRs, they must be identified. The sample-specific distribution provided by XMRs is helpful in determining the samples that abundantly accumulate the unknown metabolites for purification and structural elucidation. When combined with recently progressing technologies such as the crystalline sponge method (42), microcrystal electron diffraction (Mi-croED) (43,44), and a synthesis-based reverse-metabolomic approach (45), it is a prospective strategy for accelerating the identification of unknown metabolites.